Welcome to our final project! We are Macalester College students (class of 2021/2022) from the department of Mathematics, Statistics, and Computer Science. We took the course, Advanced Data Science in R (STAT 494), during spring semester, 2021. Below is our final project for this course.

Introduction

Computer Science is a field that is growing rapidly in the United States and around the world today. Advancements in computer science from industry are constantly being released and technology is becoming more ingrained into our daily lives. The increasing demand of computer scientists had caused the occupation to grow in popularity. To meet the demand, educational institutions and systems are increasing the amount of courses offered in order to train more future computer scientists. This development started at the college level, where majoring in computer science is becoming a widely available option. At Macalester College, it is one of the largest departments for both students and faculty. While the availability of courses at the college level is an amazing start, there is a big push to have computer sciences courses offered in K-12 education. Offering computer science courses in elementary and secondary schools provides an opportunity for kids to expose themselves to coding. This could lead younger students to discover new interests and get engaged with computer science earlier. Often, being exposed to computer science at a younger age can make students more comfortable with the material and the field later on. This can lead to a more empowered and diverse set of students entering the workforce or higher education. Given the importance of having computer science courses available in K-12 education, we decided to explore the availability of computer science courses in K-12 school districts in Minnesota. For our Advanced Data Science final project, we will explore the connection between a variety of datasets related to this topic, including K-12 computer science course availability in Minnesota, demographic information from the U.S. census, ACT scores, and funding.

Computer Science K-12 Course Availability in Minnesota

To begin with, let’s explore what computer science course availability already exists in the state of Minnesota for K-12 education. The two plots below show the various public school districts in the state with the amount and variety of computer science courses offered in each district.

These plotly maps are an interactive tool that both visually and textually shows important information. You can hover over a district and a text box will appear with relevant information, making for an easy comparison between districts.

Demographics, ACT Scores, and Finances

Due to the fact that public schools are funded by property taxes, course availability is usually an intersectional issue that is reliant on other factors. We hypothesized that there would be a correlation between course availability and overall wealth and access to resources of each district. In this section, we explore some of the variables we expected to be significant in relation to course availability.

ADD WRITING HERE

Connections

Now that we have introduced you to our various datasets, we will show you how these connect and see if there is a correlation between course availability and demographic variables, ACT scores, and funding.

Both maps include the district population and name, and either the number of total computer science courses offered or the number of computer science course categories offered. The first map in addition includes the median household income, percentage of white people, and average ACT score whereas the second map includes the total revenue and total spending per pupil. These maps also use plotly, so you can use the hover feature to view the variable information.

ADD WRITING HERE

ADD WRITING HERE

Modeling Computer Science K-12 Course Availability in Minnesota

To understand what factors have the largest influence on course availability, we created two models to predict the amount of computer science courses per district in the state of Minnesota. The first model was LASSO, a linear regression method that shrinks coefficients even to zero to eliminate insignificant variables. With over 80 possible predictors, it would be difficult to quantitatively select variables for ordinary least squares and including everything would lead to overfitting. The second one we fitted was a random forest. A random forest consists of a large number of decision trees and averages the prediction over these trees.

Before fitting the models, the main transformation we had to perform was log-transformation for many of the variables from the Annual Survey of School System Finances. These were raw tallies of revenue or expenditure, so the data were right-skewed with a few districts having significantly higher values than the majority. Based on the RMSE, the random forest greatly outperformed the LASSO, with an RMSE of approximately 1.86 compared to the LASSO’s 4.11.

In a random forest model, some variables will have higher predictive power and contribute more to the outcome. Below is a plot ranking our predictors in terms of their importance:

Each bar shows how much the RMSE would change if the corresponding variable was permuted. If permuting a certain variable significantly increases the RMSE relative to permuting other variables then it would be important. Here, the RMSE increases the most when revenue from the Child Nutrition Act, spending on instructional staff, and total expenditure are permuted. The highest-ranking variables all came from the School Survey, and the top 3 most important demographic variables from the ACS are percent of the total population who are black alone, percent of households with Internet subscription, and percent of households receiving SSI, public assistance, or foodstamps (in each district). The variables at the bottom showing no change in RMSE if permuted were excluded from the modeling right from the beginning as they are ID or raw demographic variables (for these we used their percentage version).

Implications

The first thing that is critical to highlight is that correlation does not imply causation. Although this project looks at connections between various datasets and different variables, we are not suggesting that any of our predictors directly alter computer science course availability. It is possible, given the work that we have done, but without an experiment we cannot be certain about causation.

Previous work does exist about how disparities in education are related to many of the variables we displayed in our project, such as household income and race. It is an intersectional issue that combines many layers of societal disparities. For instance, it has been proven that ACT and standardized test scores show more about family wealth and privilege than actual intelligence or likelihood for success. Therefore, it is logical that the same districts that have high average ACT scores will have high median household incomes. Due to the fact that computer science is a newer field, less work has been done specifically about this subject. The rise in availability in recent years has also been focused more on college and graduate school, leading to K-12 education to have received less attention.

That being said, there are many nuances to this issue that we could not address within the scope of our project. One variable we looked at was race, and while the connection between race and educational disparities has been studied, that can be difficult to see in some of our work. We hypothesize that there are a few reasons for this. First, Minnesota in general is largely populated by White people. Furthermore, the place where there is the most diversity, near the Twin Cities, is also a place with considerable inequality. Without this information, it may seem as though there is correlation between greater diversity, higher median household income, and computer science course availability. However, we cannot make this claim without further investigating how the inequalities in each district play a role.

Along with that, the population size of the districts could affect the outcomes. Districts can encompass many schools, and it is possible that within a district there is variation in the data. Future work might include investigating a smaller region to explore some of these nuances in order to better understand the connection between computer science course availability in K-12 education and our other variables.

For more information about how we created this project, please visit:

GitHub: https://github.com/anaelkuperwajs/STAT494-Final-Project

Behind the scenes: https://github.com/anaelkuperwajs/STAT494-Final-Project/blob/main/behind_the_scenes.Rmd